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ABSTRACT 


The  implementation  and  performance  of  a  parallel  spatial  direct  numerical  simulation 
(PSDNS)  approach  on  the  Intel  iPSC/860  hypercube  is  documented.  The  direct  numerical 
simulation  approach  is  used  to  compute  spatially  evolving  disturbances  associated  with  the 
laminar-to-turbulent  transition  in  boundary-layer  flows.  The  feasibility  of  using  the  PSDNS 
on  the  hypercube  to  perform  transition  studies  is  examined.  The  results  indicate  that  the 
DNS  approach  can  effectively  be  parallelized  on  a  distributed-memory  parallel  machine.  By 
increasing  the  number  of  processors,  nearly  ideal  linear  speedups  are  achieved  with  nonopti- 
mized  routines;  slower  than  linear  speedups  are  achieved  with  optimized  (machine-dependent 
library)  routines.  This  slower  than  linear  speedup  results  because  the  FFT  routine  domi¬ 
nates  the  computational  cost  and  because  the  FFT  routine  indicates  less  than  ideal  speedups. 
However,  with  the  machine- dependent  routines,  the  total  computational  cost  decreases  by  a 
factor  of  4  to  5  compared  with  standard  Fortran  routines.  The  computational  cost  increases 
linearly  with  spanwise,  wall-normal,  and  streamwise  grid  refinements.  The  hypercube  with 
32  processors  was  estimated  to  require  approximately  twice  the  amount  of  Cray  supercom¬ 
puter  single  processor  time  to  complete  a  comparable  simulation;  however,  it  is  estimated 
that  a  subgrid-scale  model,  which  reduces  the  required  number  of  grid  points  and  becomes 
a  large-eddy  simulation  (PSLES),  would  reduce  the  computational  cost  and  memory  re¬ 
quirements  by  a  factor  of  10  over  the  PSDNS.  This  PSLES  implementation  would  enable 
transition  simulations  on  the  hypercube  at  a  reasonable  computational  cost.  - - - - - — 
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1.  INTRODUCTION 


The  state  of  the  three-dimensional  boundary-layer  flow  on  the  wings  and  fuselage  of  an 
aircraft  determines  the  viscous  drag  portion  of  the  total  drag  of  the  aircraft.  This  viscous 
drag,  which  is  flow-state  dependent,  can  amount  to  40  or  50  percent  of  the  total  drag.  (See 
Bushnell  et  al.,  1977.)  Any  decrease  in  the  viscous  drag  can  lead  to  reduced  fuel  expenditures. 
This  fuel  savings  can  translate  directly  into  reduced  operating  costs  each  year  for  the  industry 
in  terms  of  millions  (if  not  billions)  of  dollars.  The  flow  field  on  a  wing  can  be  in  a  lamineir, 
turbulent,  or  transitional  (an  intermediate  state  that  indicates  transition  from  a  laminar  to 
a  turbulent  flow  state)  state.  Because  a  laminar  flow  state  yields  less  viscous  drag  than  a 
turbulent  flow  state,  laminar  flow  on  the  wings  is  preferable  and  results  in  a  net  fuel  savings. 
Today,  turbulent  flows  engulf  most  of  the  wing  area  of  commercial  aircraft.  Clearly,  any 
aircraft  manufacturer  that  successfully  designs  an  aircraft  with  “laminar  flow  wings”  (i.e., 
V  ings  covered  primarily  by  laminar  flow)  will  have  an  enormous  advantage. 

As  yet,  the  transition  from  laminar  to  turbulent  flow  is  not  completely  understood.  The 
first  reasonably  comprehensive  method  for  predicting  trEinsition  was  derived  from  stability 
theory,  which  is  the  method  by  Smith  and  Gamberoni  (1956)  and  Van  Ingen  (1956). 
Although  the  method  is  widely  used  to  predict  transition  in  a  broad  class  of  flows,  it  does 
have  some  limitations:  a  quasi-parallel  boundary  layer  is  assumed;  no  amplitude  information 
about  the  ingested  disturbance  in  the  boundary  layer  is  taken  into  account;  and  the  method 
is  semiempirical,  which  requires  some  foreknowledge  of  the  flow  in  transition.  The  true 
physical  problem  involves  the  ingestion  of  disturbances  that  interact  in  a  nonlinear  manner 
in  the  later  stages  of  transition  and  are  imbedded  in  a  growing  boundary  layer.  Consequently, 
a  method  that  accounts  for  nonparallel  flow  and  nonlinear  interactions  is  necessary  to  predict 
transition. 

Recently,  Herbert  and  Bertolotti  (1987)  have  devised  a  nonlinear,  nonparallel  computa¬ 
tional  method  that  is  based  on  the  parabolized  stability  equations  (PSE).  With  some  success, 
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Malik  and  Li  (1992)  have  extended  the  PSE  approach  to  eoinpnte  crossflow  disturbances  in 
swept  Hiemenz  flow.  Validation  of  this  new  approach  for  a  broad  class  of  flows  will  con¬ 
tinue  throughout  this  decade.  Before  the  development  of  this  theory,  the  only  approach  to 
solve  the  nonparallel,  nonlinear  boundary-layer  transition  problem  was  by  direct  numerical 
simulation  (DNS).  To  date,  most  studies  with  DNS  have  been  limited  to  the  temporal  for¬ 
mulation,  in  which  a  spatially  periodic  computational  domain  travels  with  th('  disturbance 
and  the  temporal  evolution  of  the  disturbance  is  computed.  This  method  has  enabled  the 
exteiLsion  of  the  simulations  into  the  later  stages  of  transition  (Zang  and  Hussaini.  1987. 
1990;  and  Laurien  and  Kleiser,  1989)  and  has  provided  a  database  of  qualitative  information 
that  unfortunately  lacks  the  physically  realistic  spatial  representation.  Spatial  DNS  com¬ 
putes  spatially  evolving  disturbances  and  can  provide  needed  quantitative  information  about 
transition.  Progress  in  spatial  DNS  has  been  made  by,  among  others,  Danabasoglu  et  al. 

( 1990,  1991)  for  channel  flows;  Easel  ( 1976),  Spalart  (1989),  Easel  et  al.  (1990),  Rai  and  Moin 
(1991a,  1991b),  Bestek  et  al.  (1992),  and  Josiinet  al.  (1992,  1993a)  for  boundary-layer  flows; 
and  Joslin  et  al.  (1993b)  for  swept-wing  flows.  Eor  a  more  complete  list  of  accomplishments 
in  transition  prediction  with  DNS,  refer  to  the  reviews  by  Kleiser  and  Zang  (1991)  and  Reed 
(1993).  Enormous  speed  and  memory  reciuirements  are  necessary  for  spatial  DNS  because 
of  the  large  domains  and  intensive  computations  that  are  involved. 

Machines  that  can  process  large  amounts  of  data  at  faster  speeds  are  in  ever  increas¬ 
ing  demand.  Two  possibilities  exist  for  achieving  high  computational  speeds:  technological 
advancements  and  parallel  computations.  Technological  advancements  alone  will  not  pro¬ 
vide  the  desired  computational  speed  because  certain  intrinsic  physical  limitations  are  being 
reached.  An  important  limitation  is  the  cycle  speed,  wdiich  is  governed  by  the  propagation 
speed  of  the  signal  in  the  given  media.  Eor  example,  the  Cray  1  (d'^Hvered  in  1976)  had 
a  cycle  time  of  12.5  nsec,  and  the  Cray  2  (delivered  in  1987)  had  a  cycle  time  of  4.1  nsec. 
Although  11  years  elapsed,  an  improvement  of  only  a  factor  ot  3  m  processor  speed  has  been 


2 


achieved.  Parallel  computation  is  a  more  attractive  alternate  approach  because  the  cost  and 
size  of  computer  components  decrease  by  an  order  of  magnitude  compared  with  Cray-type 
supercomputers,  with  only  an  incremental  decrease  in  component  speed.  The  real  advantage 
to  parallel  computing  is  the  increase  in  computational  speed  that  occurs  as  the  number  of 
processors  are  increased. 

A  large  body  of  literature  covers  the  treatment  of  numerical  algorithms  for  vector  and 
parallel  computers.  (See  Ortega  and  Voigt,  1988.)  In  most  cases,  the  numerical  treatments 
have  focused  on  simplified  problems.  In  other  cases,  algorithms  for  computationally  intensive 
kernels  in  isolation  from  the  entire  application  have  been  studied.  These  studies  are  all 
necessary;  however,  an  efficient  scheme  for  a  kernel  does  not  necessarily  result  in  the  efficient 
implementation  of  the  whole  scientific  application.  Typically,  an  entire  application  consists 
of  a  number  of  kernels  with  different  data-distribution  requirements,  which  necessitates  data 
movement  between  two  kernels.  This  movement  can  considerably  degrade  the  performance 
of  the  entire  application  on  a  parallel  machine. 

The  exploitation  of  parallelism  for  real-world  computations  becomes  an  even  greater 
challenge  in  the  absence  of  tools  that  transform  sequential  codes  into  parallel  codes.  A 
number  of  issues  must  be  considered  in  the  implementation  of  an  entire  application  on  a 
parallel  computer.  Some  of  these  issues  include 

Data  Mapping.  The  data  distribution  among  the  various  processors  of  a  parallel  ma¬ 
chine  is  a  key  factor  in  the  efficiency  of  a  parallel  implementation.  For  many  scientific 
applications,  the  optimal  data  distribution  is  not  obvious  and  requires  experimentation. 

Communication  Requirement.  The  choice  of  the  algorithm  and  the  data  distri¬ 
butions  determines  the  communication  requirements  of  an  application.  Two  factors  are 
treated  separately;  the  communication  volume  and  the  frequency  of  communication.  Fre¬ 
quent  interprocessor  communication  is  not  desirable  on  parallel  machines  because  of  the  high 
communication-setup  overheads.  On  such  machines,  large  messages  and  infrequent  commu- 
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nication  are  preferable. 


Problem  Granularity.  For  a  given  number  of  processors,  a  problem  granularity  ex¬ 
ists  below  which  parallelization  is  not  effective.  The  problem  granularity  depends  on  the 
hardware  and  software  characteristics  of  the  parallel  machine. 

Single  Node  Performance.  The  efficient  implementation  of  a  code  on  a  single  node  is 
important  because  any  improvement  in  the  computational  performance  on  a  single  processor 
will  have  a  multiplicative  effect  on  the  overall  parallel  performance. 

In  summary,  the  whole  scientific  application  needs  to  be  implemented  carefully  to  obtain 
desirable  performance  on  parallel  machines. 

Scientific  applications  can  be  clearly  categorized  according  to  their  suitability  for  paral¬ 
lel  implementation;  however,  many  scientific  and  engineering  applications  fall  into  grey  areas 
where  the  suitability  of  parallelization  for  the  application  is  not  clear.  In  most  cases,  the 
application  must  be  implemented  and  tested  to  determine  its  suitability  for  parallel  computa¬ 
tions.  These  applications  include  a  variety  of  diverse  numerical  approaches.  Some  examples 
of  these  applications  are:  Fischer  et  al.  (1988)  and  Henderson  and  Karniadakis  (1991),  who 
discussed  the  use  of  spectral  element  methods  to  characterize  the  unsteady  Navier-Stokes 
equations;  Otto  (1993),  who  studied  chemical  reactions  in  a  computational  fluid  dynamics 
code;  Jackson  et  al.  (1991),  who  studied  incompressible  turbulence  with  a  temporal  DNS 
code;  and  Edison  and  Edebacher  (1993),  who  used  a  fully  balanced  tridiagonal  solver  (all 
three  directions)  in  a  temporal  DNS  code  to  study  compressible,  isotropic  turbulence. 

The  goal  of  the  present  research  is  to  modify  a  spatial  DNS  approach  described  by  Joslin 
et  al.  (1993a)  to  perform  boundary-layer  transition  computations  on  parallel  computers.  The 
suitability  of  the  DNS  code  for  parallelization  on  a  distributed-memory  parallel  machine,  the 
Intel  iPSC/860,  is  examined. 
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2.  GOVERNING  EQUATIONS 


To  compute  the  disturbance  development,  the  incompressible  Navier-Stokes  equations 
are  solved.  The  streamwise  direction  is  x,  the  wall- normal  direction  is  y,  and  the  spajiwise 
direction  is  z.  A  sketch  of  the  computational  domain  is  shown  in  Fig.  1.  Instantaneous  ve¬ 
locities  u  =  (u,  u,  w)  and  pressure  p  are  decomposed  into  the  base  components  U_  =  {U,  V,  W) 
and  P  and  the  disturbance  components  u  =  (u,v,w)  and  p  so  that 

u{x,  t)  =  U_{x)  -t-  u{^,  t)  and  p(x,  t)  =  P(x)  +  p{x,  t)  (1) 


where  x  =  (x,  y,  z)  and  t  is  time. 

The  base  flow  is  generally  the  steady-state  solution  of  the  Navier-Stokes  equations.  For 
simplicity,  this  study  will  use  the  Blasius  similarity  profiles  for  the  base  flow  of  a  flat-plate 
transition  problem. 

To  determine  the  disturbance  component  of  the  instantaneous  velocities  and  the  pres¬ 
sure,  substitute  equation  (1)  into  the  Navier-Stokes  equations  and  subtract  the  base-flow 
equations.  The  resulting  unsteady,  nonlinear  disturbance  equations  are 


Q  1 

^  +(u-V)u  +  iU-  V)u  -I-  (u  .  V)U  =  -  Vp  +  ^  (2) 

ut 


and  the  continuity  equation  is 


V  •  u  =  0 


(3) 


Boundary  conditions  at  the  wall  and  in  the  far  field  are 


a  =  0  at  y  =  0  and  a  — >  0  as  y  — >  cx)  (4) 

The  equations  have  been  nondimensionalized  with  respect  to  the  free-stream  velocity  Uoci 
the  kinematic  viscosity  u,  and  some  length  scale  at  the  inflow  (e.g..  displacement  thickness 
S*}.  A  Reynolds  number  can  then  be  defined  as  J?*.  =  Uoo^Hu. 
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3.  NUMERICAL  TECHNIQUES 


In  the  streamwise  direction  ( j-direction),  fourth-order  central  finite  differences  are  used 
for  the  pressure  equation.  At  boundary  and  near-boundary  nodes,  fourth-order  differences 
are  used.  For  the  first  and  second  derivatives  in  the  inoinentuin  equations,  sixth-order 
compact  differences  by  Lele  (1992)  are  used.  At  the  boundary  and  near-boundary  nodes, 
explicit  fifth-order  finite  differences  are  used.  The  compact  differences  lead  to  tridiagonal 
systems,  and  the  central  finite  differences  lead  to  a  pentadiagonal  system;  both  of  these 
systems  can  be  solved  efficiently  by  an  LU- decomposition  with  the  appropriate  backward 
and  forward  substitutions. 

In  the  wall-normal  direction  (y-direction),  a  Chebyshev  series  is  used  to  approximate  the 
disturbance  at  the  Gauss-Lobatto  collocation  points.  Because  this  series  and  its  associated 
spectral  operators  are  defined  on  [-1,  1]  and  the  physical  problem  of  interest  has  either  a 
semi-infinite  [0,oo)  or  a  truncated  domain  [0,  j/max],  a  transformation  is  employed.  Studies 
of  spectral  methods  and  mapping  transformations  in  unbounded  regions  have  been  conducted 
by  Grosch  and  Orszag  (1977)  and  Boyd  (1989).  Here,  an  algebraic  mapping  is  used: 

y  _  ymax'^p(l  +  y)  _  _  (25p  -I-  ymax)y  —  ymax^p  ^ ^ ^ 

2-5p  +  ymax(l  —  y)  ymaxl-Sp  +  y) 

where  y  G  [0,yinax)  and  y  G  [—1,1];  ymax  is  the  wall-normal  distance  from  the  wall  to 
the  far-field  boundary  in  the  truncated  domain;  and  Sp  controls  the  grid  stretching  in  the 
wall-normal  direction.  The  Chebyshev  series  operators  lead  to  matrix-matrix  multiplies. 

In  the  spanwise  direction  ( 2-direction),  periodicity  is  assumed,  which  allows  for  Fourier 
series  representations.  With  the  Fourier  series,  spectral  accuracy  is  obtained  in  the  spanwise 
direction,  and  fast  Fourier  transforms  (FFT),  or  sine  and  cosine  transforms,  may  be  used  for 
fast  computation  of  derivatives.  The  general  Fourier  series  leads  to  FFT  operations,  and  the 
sine  and  cosine  series  lead  to  matrix-matrix  multiply  operations.  For  more  details  on  the 
spectral  methods  used  here,  refer  to  Canuto  et  al.  (1988). 
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For  time  marching,  a  time-splitting  procedure  was  used  with  implicit  second-order 
Crank-Nicolson  differencing  for  normal  diffusion  terms;  an  explicit  third-order  three-stage 
Runge-Kutta  (RK)  method  was  used  for  the  remaining  terms.  This  time-stepping  procedure 
was  used  successfully  by  Streett  and  Hussaini  (1991)  for  Taylor-Cou^tte  flow  simulations. 
The  pressure  is  omitted  from  the  momentum  equations  (2)  for  the  fractional  RK  stage, 
which  leads  to 

^  +  (m*  •  V)u*  (U  ■  V)u*  +  (u*  •  V)U  =  (6a) 

(ft  lis^ 

with  boundary  conditions 

u*  =  u*  (66) 

Time  is  advanced  from  u"*  to  the  intermediate  disturbance  velocities  u*,  and  u*  are  inter¬ 
mediate  boundary  conditions  that  will  be  explained  later  in  this  section. 

A  full  RK  stage  is  completed  by  advancing  the  solution  in  time  from  u*  to  by 

^  (7a) 

and 

V  ■  =  0  (76) 

By  taking  the  divergence  of  equation  (7)  and  imposing  zero  divergence  of  the  flow  field  at 
each  RK  stage,  a  pressure  equation  is  obtained 

=  (8) 

which  is  subject  to  homogeneous  Neumann  boundary  conditions;  are  time-step  sizes  in 
the  RK  scheme.  This  boundary  condition  is  justified  in  the  context  of  a  time-splitting  scheme 
as  discussed  by  Streett  and  Hussaini  (1991). 

Because  the  pressure  equation  (8)  is  an  inviscid  calculation  and  involves  boundary  con¬ 
ditions  on  the  normal  component  of  velocity  only,  a  nonzero  tangential  velocity  component 
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may  arise  at  the  computational  boundary  at  the  end  of  each  full  RK  time-step,  which  is 
referred  to  as  a  “slip  velocity.”  To  correct  this  problem,  intermediate  boundary  conditions 
as  described  by  Streett  and  Hussaini  (1991)  and  Joslin  et  al.  (1992)  are  used,  which  are 
given  by 


iC  =  Mac  +  hT 


h"*  h”' 

/z  I  tl  ^ 


(9) 


where  =  U  for  a  rigid  wall  and  for  an  inflow  or  a  wall  slot  condition,  evaluated 

at  the  appropriate  time  in  the  RK  stage. 

The  solution  procedure  follows:  The  intermediate  RK  velocities  a*  are  determined  by 
solving  equation  (6).  The  pressure  p"*"*"'  is  found  by  solving  equation  (8).  Then,  the  full 
RK  stage  velocities  a"*'*''  are  obtained  from  equation  (7).  After  the  above  system  is  solved 
three  consecutive  times,  full  time-step  velocities  result.  The  three-stage  RK  time  steps  given 
by  Williamson  (1980)  are  =  {1/3,5/12,  l/4}/i(,  where  the  sum  of  the  three  RK 

time  stages  equals  the  full  time  step  (/i<). 

To  obtain  the  pressure  p  for  the  two-  and  three-dimensional  boundary-layer  problems, 
solutions  of  the  Poisson  equation  (8)  for  each  RK  stage  arc  required.  For  thrtc-diinensional 
simulations  with  spanwise  periodicity  assumed,  the  pressure  is  determined  in  transform  space, 
where  the  Fourier  coefficients  are  evaluated. 

For  the  two-dimensional  problem  and  the  zero-wave-number  component  of  the  three- 
dimensional  problem,  the  Poisson  equation  with  Neumann  boundary  conditions  is  equiva¬ 
lent  to  the  composite  solution  of  a  Poisson  problem  and  ^  Laplace  problem  with  Dirichlet 
boundary  conditions  (Streett  and  Hussaini,  1991): 


=  Eo  in  r  //  =  0  on  ^F 


(lOu) 


V^p‘'  =  0  in  r  P^'=Iyl  pi,  on  dY  (10/)) 

where  p^^  are  gradients  normal  to  the  boundaries,  Ro  is  the  known  zero-wav(’-numb('r  Fouih'r 
coefficient  of  the  right  side  of  the  jjiessure  equation  (8),  and  lyy  is  the  influence  matrix. 
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This  boundary  condition  gives  the  influence  of  the  right  side  Rg  on  the  boundary.  The 
final  solution  p  ~  satisfies  the  original  problem  and  the  boundary  conditions.  The 

influence-matrix  te  •hni([ue  is  also  used  for  the  pressure  solver  to  ensurt'  that  the  continuity 
equation  is  discretely  satisfied.  Details  of  the  influence-matrix  technique  are  given  by  Streett 
and  Hussaini  (1991),  Danabasoglu  et  al.  (1991),  and  Joslin  et  al.  (1992).  To  form  the 
influence  matrix,  a  sequence  of  solutions  is  first  determined  for  a  problem 

V^p'  =  0  in  r  p‘  =  6,j  on  dl  (11) 

for  each  discrete  boundary  point  {Xj).  The  6,j  is  the  Dirac  Delta  function  defined  as  = 
1  for  i  =  j  and  =  0  for  i  ^  j .  After  the  vector  of  normal  gradients  pj,  is  computed  at  all 
of  the  boundary  points,  these  vectors  are  stored  m  columns  to  yield  a  matrix  referred  to  as 
the  influence  matrix 

f.VF  =  [Pn'Pn’ •  •  •  'Pn  (12) 

where  Nb  is  the  number  of  boundary  points  minus  the  corner  points.  The  composed  influence 
matrix  gives  the  residuals  of  p  as  a  result  of  the  unit  boundary  condition  influence.  The  value 
of  one  boundary  condition  is  temporarily  relaxed  so  that  the  problem  is  not  overspecified. 
This  relaxation  is  accomplished  by  setting  one  column  of  the  influence  matrix  to  zero,  except 
for  the  boundary  point  of  interest,  which  is  set  to  unity.  The  corresponding  residual  is  exactly 
zeroed. 

Because  the  gradient,  or  boundary  condition,  at  one  discrete  boundary  point  was  relaxed 
in  the  influence-matrix  fornudation,  the  desired  condition  (pn  =0)  may  not  hold  at  that 
boundary  point.  The  desired  condition  may  not  hold  because  the  discrete  compatibility 
relation  may  not  lie  true  for  the  pur<’  Xeumann  problem.  To  regain  this  boundary  conddion. 
the  pressure  problem  (10)  is  solved  again:  this  time  a  nonzero  constant  (c.g.,  0.01)  is  added 
to  the  right  side  of  equation  (10a).  A  pressure  correction  p  results.  The  composite  solution 
satisfies  the  boundary  conditions  at  all  discrete  nodes  and  consists  of  a  linear  combination 


9 


of  p  aiul  p.  This  conibiiiation  is  found  by  sa':isfyiug  the  equations 

o^iPn  +  «2P„  — ■  0  on  5r,  and  ui  -j- 02  =  1  (13) 

The  final  pressure  {p‘  is  then  given  by 

+ (1  -  (Xi)p  with  ax  r^p,J{p,^- pn)  (14) 

In  transform  spare,  Poisson  efpuitions  are  solved  for  the  zero- wave -number  conq)onent; 
for  the  remaining  wave  numbers,  Helmholtz  equations  are  solved.  To  solve  the  equations 
efficiently,  a  fast  elliptic  solver  is  required.  For  this  purpose,  the  tensor- product  method 
described  by  Lynch  et  al.  (1964)  is  used.  On  a  nonstaggered  grid,  this  approach  was 
employed  by  Danabasoglu  et  al.  (1990,  1991)  for  the  channel  problem  and  by  Joslin  et  al. 
(1992)  for  the  boundary-layer  problem. 

The  discretized  equations  become 

[Hn  -  f3l)Pn  +  PnX^  =  Rn  (15) 

where  p„  are  nonzero  Fourier  coefficients  of  the  desired  pressure  solution  p,  H„  are  the  wall- 
iK)rmal  derivative  operators,  is  the  transpose  of  the  streamwise  central  finite-difference 
operator,  =  n/?  are  spanwise  wave-number  coefficients  of  the  Fourier  series,  and  Rn  are 
kn(jwn  Fourier  coefficients  of  the  right  side  of  the  pressure  equation  (8). 

The  following  matrix  operations  determine  the  wall-normal  operators  Hn  ■ 

=  and  (16) 

where  LI  is  a  spectral  wall-ncrmal  derivative  operator  for  the  stretched  grid,  D  is  the  deriva¬ 
tive  matrix  with  the  first  and  last  row:  set  to  zero,  and  D’  is  the  spectral  wall-normal  second 
derivative  operator  with  the  boundary  cond’tions  contained  in  the  first  and  last  rows.  This 
modification  enforces  the  homogeneous  Neumann  boundary  conditions  recpiired  for  p  on  the 
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Gauss-Lobatto  grid.  The  zero- wave- number  equation  is  solved  on  Gauss-Lobatto  points; 
all  other  wave-number  equations  are  solved  on  Gauss  points.  The  interpolation  matrix 
operates  on  variables  at  Gauss-Lobatto  points  and  transforms  them  to  Gauss  points;  the 
interpolation  matrix  1°^-  performs  the  inverse  operation.  These  operators  are  defined  by 
Joslin  et  al.  (1993a). 

The  operator  Hn  is  decomposed  into 


Hn  =  QAQ-'  (17) 

where  A  is  a  diagonal  matrix  of  eigenvalues  and  Q  is  the  corresponding  matrix  of  eigenvectors. 
Temporary  matrices  are  introduced  and  defined 

Pn=Q~^Pn  and  Gn=Q~^Rn  (18) 

Equations  (17)  and  (18)  are  substituted  into  equation  (15)  to  obtain 

Apn+PnX^  =  Gn  (19) 

Because  is  a  fourth-order  accurate,  pentadiagonal  matrix,  the  LU-decomposition  method 

is  used  and  provides  an  efficient  method  to  solve  the  pressure  equation  (19).  Equation  (19) 
is  used  to  solve  for  pm  which  is  then  used  in  equation  (18)  to  solve  for  pn- 

The  operators  i?„,  the  eigenvalue  and  eigenvector  matrices  Q,Q~\A,  the  penta¬ 
diagonal  operator  X^,  and  the  influence  matrix  Jjvf  are  all  mesh-dependent  matrices  and 
need  to  be  calculated  only  once. 

Here,  disturbances  are  introduced  into  the  boundary  layer  by  forcing  at  the  inflow 
boundary;  however,  the  swept-wing  problem  of  Joslin  et  al.  (1993b)  used  surface  suction 
and  blowing  to  introduce  disturbances.  At  the  outflow,  the  buffer-domain  technique  of 
Streett  and  Macaraeg  (1989)  is  used. 
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4.  PARALLEL  IMPLEMENTATION  OF  THE  SPATIAL  DNS 


In  this  section,  the  various  clata-distribution  options  available  for  iinplenientation  in  the 
three-dimensional  DNS  code  on  a  parallel  machine  are  discussed,  and  the  data  distribution 
used  in  this  application  is  outlined. 

4.1  Data  Mapping 

The  DNS  code  consists  of  a  number  of  computationally  intensive  kc'inels.  Depi'iident 
upon  tlu'  data  maiiping.  some  of  tlu'si'  kernels  are  executed  locally  on  a  siugh'  proci'ssor. 
and  the  rest  me  executed  globally  across  tht'  processors.  The  kernels  that  are  executed  lo¬ 
cally  do  not  recpiire  communication  between  processors;  kernels  that  are  executi'd  globally 
recpiire  communication.  The  major  computationally  intensive  kernels  are  the  matrix-matrix 
multiplication,  the  FFT,  the  tridiagonal  solver,  and  the  peutadiagonal  solver.  Tlu'  operation 
counts  that  correspond  to  the  kernels  are  illustrated  in  table  1;  these  op('ratiou  counts  are 
for  a  one-time  it('ration  of  the  DNS  code.  Of  these  major  kernels,  the  matrix-matrix  nmlti- 
})licatiou  is  the  most  computationally  intensive  kernel.  Hereafter,  tip  denotes  the  numbcT  of 
processors  on  a  parallel  machine,  and  n^.,  n,j.  and  n-  denote  the  number  of  data  items  (i.e.. 
grid  points)  in  the  streamwise.  wall-normal,  and  spanwise  directions. 

F<n’  the  three-dimensional  problem,  three  major  data  mappings  exist: 
x-mapping.  TIk’  three-dinu'iisional  data  are  partitioned  into  rij.  two-dimensional  planes 
of  Hyii:  data  items  each.  The  first  Uj  /tip  planes  are  mapped  to  processor  po-  Hie  next  ti^  /tip 
planes  are  mapped  to  processor  p\ ,  and  so  on. 

y-mapping.  The  three-dimensional  data  are  partitioned  into  iiy  two-dimensional  planes 
of  UxU:  data  items  each.  The  first  nylup  planes  are  majiped  to  processor  po-  Hie  next  nyjnp 
planes  are  mapped  to  processor  p\ .  and  so  on. 

z-mapping.  The  three-dimensional  data  are  partitioned  into  n  -  two-dimensional  jilanes 
of  Uj-Uy  data  items  each.  The  first  n-jup  planes  are  mapped  to  processor  po.  the  next  n-Jup 
planes  are  mapped  to  processor  pi,  and  so  on.  An  example  of  this  mapping  is  shown  in  Fig. 
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2  for  Up  =  4. 

As  stated  earlier,  the  data  mapping  determines  whether  a  particular  kernel  is  to  be 
executed  across  all  processors  or  executed  locally  on  a  single  processor.  Table  2  shows  the 
executions  of  the  major  kernels  for  the  three  data  mappings.  For  the  x-mapping,  a  great  deal 
of  communication  is  clearly  required,  which  is  undesirable.  Both  the  y-  and  2-mapping  are 
more  desirable  than  the  x-mapping  because  most  of  the  kernels  are  executed  locedly.  Because 
the  operation  counts  shown  in  table  1  indicate  that  the  matrix-matrix  multiply  is  higher  than 
FFT’s,  the  2-mapping  should  be  more  efficient  than  the  y-mapping.  The  2-mapping  requires 
that  one  kernel  (FFT)  be  executed  globally.  Our  implementation  of  the  PSDNS  code  on  the 
Intel  iPSC/860  is  based  on  the  2-mapping. 

4.2  FFT  Implementation 

The  FFT  kernel  computes  sequences  of  discrete  Fourier  transforms  of  size  n^.  The 
2-mapping  distributes  sequences  across  all  processors  of  the  machine.  One  way  to  compute 
the  discrete  Fourier  transform  of  these  sequences  is  as  follows:  first,  the  transpose  is  taken 
of  the  three-dimensional  data  (the  resulting  distribution  is  the  x-mapping  of  u),  the  one¬ 
dimensional  FFT  algorithm  is  then  executed  on  sequences  of  length  Uj  on  each  processor,  the 
results  are  multiplied  by  a  coefficient  array,  the  inverse  Fourier  transform  of  the  data  is  then 
computed,  and.  finally,  the  results  are  transposed  back  to  the  original  data  distribution.  In 
this  scheme,  all  global  data  movement  occurs  in  the  transpose  step.  To  keep  communication 
overheads  to  a  minimum,  the  transpose  operation  must  be  implemented  efficiently.  The 
transpose  operation  is  a  complete  exchange  operation;  every  node  has  an  equivalent  amount 
of  data  to  exchange  with  every  other  node.  The  xor  algorithm  is  used  to  implement  this 
exchange  procedure  because  it  is  the  optimal  procedure  on  the  Intel  iPSC/SGO.  The  xor 
algorithm,  which  is  illustrated  in  table  3,  schedules  various  exchanges  at  particular  nodes  to 
avoid  link  contention.  In  this  scheme  at  the  zth  step,  a  node  j  sends  data  to  node  iX j . 
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5.  PSDNS  VALIDATION 

The  evolution  of  a  Tollmien-Schlichting  wave  in  the  three-cliniensional  flow  is  list'd  to 
validate  the  PSDNS  approach.  A  disturbance  with  an  initial  arnplitude  A"  q  =  1  x  10“*^  is 
introduced  into  the  boundary  layer  by  a  forcing  at  the  inflow  for  the  PSDNS.  The  inflow 
disturbance  profiles  are  obtained  with  linear  stability  theory  (LST).  The  parallel-flow  as¬ 
sumption  is  used  for  comparison  with  LST.  Calculations  are  made  with  an  inflow  Reynolds 
number  =  900  and  frequency  uj  =  0.0774.  The  PSDNS  was  computed  on  a  grid  of  200 
uniformly  spaced  streamwise  nodes  (60  nodes  per  disturbance  wavelength),  61  wall-normal 
collocation  points,  and  8  spanwise  nodes.  The  outflow  boundary  is  121^*  from  the  inflow 
boundary,  the  far-field  (or  free-stream)  boundary  is  756*  from  the  wall,  and  the  wall-normal 
grid-stretching  parameter  Sp  is  equal  to  10.  For  the  time-marching  scheme,  the  distuibance 
period  is  divided  into  320  time  steps. 

Figure  3  shows  the  streamwise  evolution  of  the  computed  streamwise  (u)  and  wall- 
normal  (e)  velocity  components  of  PSDNS  compared  with  LST.  Very  good  agreement  in 
arnplitude  and  phase  are  found  at  every  spanwise  location  in  the  physical  domain.  (Note, 
that  the  buffer-domain  region  is  nonphysical  and  that  the  DNS  and  LST  results  are  not 
expected  to  agree.) 

6.  PERFORMANCE  OF  PSDNS  ON  INTEL  IPSC/860 

Although  direct  simulations  of  transition  involve  large  computational  grids  and  many 
thousands  of  time  steps  per  simulation,  the  performance  of  the  PSDNS  code  can  be  suffi¬ 
ciently  examined  for  a  single  time  step  on  smaller  grids.  The  cost  and  feasibility  of  a  full-scale 
simulation  can  be  estimated  by  using  scaling  information. 

The  range  of  parameters  is  limited  to  the  capability  of  the  machine.  The  Intel  iPSC/860 
hypercube  at  NASA  Langley  Research  Center  has  32  processors,  each  with  8  megabytes  of 
memory.  Because  the  single  precision  is  limited  to  32-bit  words  and  because  simulations  of 
transition  require  the  computations  of  small-scale  phenomena,  all  performance  test  cases  are 
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double- precision  (64-bit  words)  computations. 

The  first  sequence  of  performance  simulations  is  computed  on  a  grid  of  64  streamwise 
points  and  41  wall-normal  points.  Figure  4  shows  both  the  computational  cost  (total  cost 
minus  communication)  and  the  communication  cost  for  each  processor  in  CPU  sec  for  a 
variation  in  the  spanwise  grid  and  the  number  of  processors.  For  a  given  computational  grid, 
a  decrease  in  both  the  computational  and  communication  cost  is  achieved  by  increasing  the 
number  of  processors.  For  this  parallel  implementation,  the  communication  cost  does  not 
exceed  6  percent  of  the  total  cost  for  all  grids  and  variations  in  the  number  of  processors 
that  were  considered. 

Figure  5  shows  the  relative  cost  of  the  major  numerical  techniques  and  shows  the  speedup 
of  each  technique  with  the  number  of  processors.  As  expected,  the  computational  cost  break¬ 
down  indicates  that  the  majority  of  the  time  is  spent  on  matrix-matrix  multiply  operations 
(table  1).  The  results  suggest  that  a  negligible  change  occurs  in  the  cost  contributions  from 
each  numerical  technique  with  an  increase  in  the  number  of  processors.  The  speedup  of  each 
numerical  technique  as  the  number  of  processors  increases  indicates  a  nearly  ideal  linear 
speedup  for  the  matrix-matrix  multiply  and  the  tridiagonal  solver.  Because  matrix-matrix 
multiplies  account  for  nearly  80  percent  of  the  total  computational  cost  and  because  the 
speedup  for  the  matrix-matrix  multiplies  are  nearly  ideal,  the  total  speedup  approaches  a 
nearly  linear  rate.  For  example,  the  theoretically  ideal  speedup  rate  is  4  with  an  increase 
from  8  to  32  processors  and  the  speedup  of  3.4  was  realized  in  the  total  computational  cost. 

Figure  6  shows  the  computational  cost  and  the  slowdown  for  a  spanwise  grid  refinement 
with  8  processors.  Matrix-matrix  multiplies  dominate  the  PSDNS  cost;  matrix-matrix  mul¬ 
tiplies  and  tridiagonal  and  pentadiagonal  solvers  slow  down  at  a  faster  rate  than  other  major 
numerical  techniques.  For  example,  the  matrix-matrix  multiply  has  a  theoretically  ideal  rate 
of  slowdown  of  8  when  Uj  is  increased  from  8  to  64.  The  FFT  adds  the  least  additional  cost 
as  the  spanwise  grid  is  refined  from  8  to  64;  the  FFT  results  in  only  an  in  ’r^ase  of  a  factor  of 
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3.3  in  tlu'  computational  cost.  (The  ideal  slowdown  rate  is  8,  which  is  undesirable  because 
of  the  cost  increase.) 

The  relative  balance  in  work  load  between  respective  processors  is  an  important  element 
in  documenting  the  PSDNS  performance.  Figure  7  shows  the  computational  and  communi¬ 
cation  cost  for  each  stage  of  the  three-stage  Runge-Kutta  time  step  for  each  processor  of  an 
8-processor  simulation  on  a  spanwise  grid  of  =  8.  In  this  figure,  the  AIMS  performance 
software  shows  the  work  load  for  each  processor  in  a  separate  display  area.  The  lines  that 
connect  the  processor  areas  indicate  global  communication;  the  shaded  areas  indicate  the 
computational  work;  the  blank  spaces  between  shaded  areas  indicate  idle  times.  The  results 
show  that  all  processors  are  load  balanced,  with  the  exception  of  the  first  node  (node  =  0). 
For  the  present  combination  of  numerical  techniques  and  parallel  implementation,  this  is  the 
best  Lad  balance  that  can  be  expected.  Because  of  the  influence-matrix  pressure  solver  that 
is  used  on  the  first  node  only,  additional  work  is  always  required  on  this  node.  The  idle  time 
amounts  to  about  20  percent  of  the  total  cost  for  a  single  time-step  advancement. 

In  the  second  series  of  simulations,  the  matrix-matrix  multiply  is  optimized  because  it 
requires  over  80  percent  of  the  total  computational  cost  of  the  PSDNS  on  the  hypercube 
(see  Fig.  5).  For  a  more  efficient  code,  the  mo -r  ic-matrix  multiply  routine  would  have  to  be 
improved  (if  possible).  Instead  of  a  standard  Fortran  routme,  the  library  routine  DGEMM 
is  used  here  to  attempt  optimization.  The  performance  simulations  are  repeated. 

Figure  8  shows  the  computational  and  communication  costs  with  a  spanwise  grid  re¬ 
finement  and  an  increase  in  the  number  of  processors.  In  comparison  with  Fig.  4,  the  trend 
of  reduced  cost  as  the  number  of  processors  is  increased  remains  the  same;  hcrwever.  the 
relative  communication  cost  has  become  significant.  Aitnc-i.^.x  the  quantitative  cost  of  com¬ 
munication  is  the  same  as  before  the  new  matrix-matrix  multiply  routine  was  introduced,  the 
communication  now  equals  20  to  30  i)ercent  of  the  total  cost  because  the  new  matrix-matrix 
multiply  routine  has  reduced  the  total  computational  cost  l)y  a  factor  of  4  to  5  in  c<jmparison 
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with  the  original  code  implementation. 

Figure  9  shows  the  relative  cost  of  the  numerical  techniques  and  the  speedup  of  each 
technique  with  the  number  of  processors.  With  the  new  matrix-matrix  multiply  routine, 
the  major  numerical  techniques  are  more  balanced  in  terms  of  relative  work  load.  The 
communication  and  FFT  have  comparable  relative  cost  and  now  dominate  over  the  other 
major  numerical  techniques.  The  matrix-matrix  multiply  and  the  tridiagonal  solver  both 
have  nearly  ideal  speedups  as  before;  however,  the  total  speedup  is  only  about  half  of  the 
ideal  rate.  This  decrease  in  efficiency  occurs  because  the  FFT  routine  is  now  the  dominant 
numerical  technique  and  the  FFT  rate  of  speedup  is  not  ideal. 

Figure  10  shows  the  numerical  cost  and  the  slowdown  for  a  spanwise  grid  refinement  with 
eight  processors.  The  numerical  techniques  and  communication  cost  are  balanced;  however, 
communication  becomes  dominant  as  the  spanwise  grid  is  refined.  The  slowdown  rate  as 
spanwise  grid  is  increased  are  the  same  as  those  for  the  initial  comparison  (Fig.  6);  however, 
the  total  slowdown  rate  has  decreased  significantly.  Because  the  FFT  routine  is  the  dominant 
numerical  technique  and  because  the  FFT  slowdown  rate  is  small,  the  total  rate  of  slowdown 
closely  follows  the  FFT  rate.  This  result  is  advantageous;  it  indicates  that  little  additional 
cost  will  result  from  refinement  of  the  grid  in  the  spanwise  direction. 

The  present  multiprocessor  implementation  of  the  PSDNS  can  be  evaluated  by  varying 
the  number  of  x  —  y  planes  on  each  processor.  For  example,  the  use  of  eight  spanwise  grid 
points  on  an  eight-processor  simulation  indicates  that  each  processor  performs  computations 
on  a  single  x  —  y  plane;  eight  spanwise  grid  points  on  a  four  processor  simulation  indicates 
that  each  processor  performs  computations  on  two  .r  —  y  planes.  Figure  11  shows  the  cost 
for  a  single  x  —  y  plane  on  each  processor  compared  with  the  cost  of  four  x  —  y  planes  on 
each  processor.  Because  FFT  and  global  communication  costs  increase,  cost  increases  are 
incurred  in  both  cases  with  increased  number  of  processors.  Although  the  FFT  technique 
and  the  global  communication  are  both  effected  by  changes  in  the  number  of  processors  for  a 
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fixed  mimher  of  .c  -  /j  planes  on  each  processor,  the  costs  are  constant  for  all  other  techni<ju<'s 
(Fig.  11).  For  four  J'  —  y  planes  on  each  processor,  note  the  initial  drop  in  the  coiui)ntatioiial 
cost  from  two  to  four  processors. 

In  the  remainder  of  this  section,  the  streamwise  and  wall-normal  grids  will  b<-  refined 
to  determine  the  cost  scalings.  The  relative  cost  of  the  major  numerical  teclmicpu's  and 
the  speedup  of  each  technitpie  with  the  number  of  processors  are  shown  in  Fig.  12.  with  a 
streamwise  grid  refinement  from  /ij-  =  64  to  tt^  =  128.  Decreases  in  both  the  computational 
and  communication  costs  are  achieved  by  increasing  the  number  of  processors.  The  relative 
communication  cost  accounts  for  10  to  30  percent  of  the  total  computational  cost. 

Figure  13  shows  the  relative  numerical  cost  breakdown  for  the  dominant  kernels.  The 
numerical  techniques  and  the  communication  are  balanced;  the  FFT  routine  and  commu¬ 
nication  dominate  the  cost.  Similar  to  the  speedup  rates  from  the  earlier  case,  the  total 
rate  of  speedup  (as  the  number  of  processors  increases)  is  slower  than  the  ideal  speedup  rate 
because  the  dominant  FFT  routine  and  the  communication  have  speedup  rates  that  are  less 
than  ideal. 

Figure  14  shows  the  computational  and  communication  costs  with  a  spanwise  grid  vari¬ 
ation.  The  results  show  trends  similar  to  the  =  64  grid  shown  in  Fig.  9.  If  the  spanwise 
grid  is  refined  by  a  factor  of  four,  then  a  factor-of-three  increase  in  the  total  computational 
cost  results. 

As  in  Fig.  11,  Fig.  15  shows  the  cost  of  the  PSDNS  with  a  single  x  —  y  plane  on  each 
processor  and  with  four  .r  —  y  planes  per  processor.  For  the  single  x  ~  y  plane  on  each 
processor,  the  cost  increases  with  the  number  of  processors  in  use;  however,  the  case  with 
four  X  —  y  planes  on  each  processor  results  in  a  decrease  in  the  total  computational  cost. 
This  result  is  encouraging  for  the  performance  of  large-scale  simulations  with  a  large  number 
of  processors. 

Figures  6,  10,  and  14  show  the  effect  of  spanwise  grid  refinements  on  the  computational 
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and  communication  costs.  Figure  16  shows  that  streamwise  (x)  refinements  lead  to  nearly 
linear  theoretical  increases  in  cost;  Fig.  17  shows  that  wall-normal  (y)  refinements  also  lead 
to  nearly  ideal  linear  increases  in  computational  cost.  Although  the  results  in  Fig.  16  show 
that  the  matrix- matrix  multiplies  scale  like  n^,  the  FFT  and  communication  dominate  the 
cost  leading  to  the  total  cost  scaling  like  the  FFT  rate. 

7.  DISCUSSION 

The  present  performance  data  for  the  PSDNS  suggest  that  insufficient  core  memory  is  a 
limitation  of  the  hypercube.  The  largest  grid  that  fits  on  a  single  node  has  128  streamwise, 
41  wall-normal,  and  4  spanwise  points  (21,000  total  grid  points).  An  attempt  to  perform 
computations  with  8  spanwise  planes  of  this  same  x  —  y  grid  failed  because  of  insufficient 
memory.  Because  the  code  requires  about  160  bytes  per  grid  point,  the  21,000-point  grid 
used  about  3.4  megabytes  of  memory;  the  failed  grid  required  6.8  megabytes  (plus  operating 
system).  The  largest  grid  that  could  potentially  be  used  for  the  present  PSDNS  code  has  less 
than  42,000  grid  points  on  each  processor.  With  the  code  optimized  for  the  matrix-matrix 
multiplies,  the  total  cost  for  a  single  time  step  varied  significantly  with  the  grid  and  the 
number  of  processors  in  use.  To  determine  if  simulations  of  transition  can  be  undertaken, 
the  grid  requirements  must  be  specified  to  estimate  the  computational  cost  requirements. 
The  grid  resolution  is  highly  dependent  on  the  problem  and  the  numerical  techniques.  To 
estimate  the  feasibility  of  using  PSDNS  on  the  hypercube,  a  sample  transition  problem 
computed  on  a  single  processor  of  a  Cray-2  supercomputer  is  used  for  comparison. 

In  a  recent  study,  Joslin  and  Streett  (1993b)  computed  the  nonlinear  evolution  of  a 
crossflow  vortex  packet  on  a  swept  wing  with  spatial  DNS  on  the  Cray-2.  The  cost  of  this 
computation  amounted  to  approximately  125  CPU  hr  with  a  single  processor.  The  grid 
contained  901  chordwise,  61  wall-normal,  and  32  spanwise  points  (1.76  million  total  points) 
which  required  36.6  megabytes  of  core  memory.  The  unsteady  computation  required  9500 
time  steps  in  the  time-marching  scheme  to  reach  the  nonlinear  inflectional  velocity  profile 
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stage,  which  occurs  just  prior  to  the  lainiiiar-to-turbulent  transition. 


Each  x-y  plane  of  the  Joslin  and  Streett  ( 1993b)  study  contained  55.000  grid  points  (8.8 
megabytes  of  memory),  which  is  beyond  the  capability  of  the  present  hypercube  (8  megabytes 
per  processor).  In  this  case,  the  feasibility  of  using  PSDNS  has  been  easily  determined  by 
examining  the  memory  limitation  alone.  However,  if  16  or  32  megabytes  of  memory  per 
processor  were  available  transition  studies  could  potentially  be  conducted  on  the  hypercube. 
Then  the  feasibility  of  using  this  parallel  computer  would  rest  on  the  computational  cost  of 
such  a  simulation. 

The  temporal  cost  can  be  determined  based  on  the  previous  performance  results  of  a 
single  time  step.  From  data  in  Figs.  16  and  17,  the  rates  at  which  computational  cost 
increases  with  streamwise  and  wall- normal  grid  refinements  can  be  determined  at  1.95  and 
2.15,  respectively.  The  performance  results  indicate  that  a  single  time  step  on  a  grid  of  64 
streamwise,  41  wall-normal,  and  32  spanwise  points  distributed  on  32  processors  will  cost  3.7 
sec  for  each  processor.  With  the  scaling  rates,  the  computation  by  Joslin  and  Streett  (1993b) 
performed  on  the  hypercube  is  estimated  to  cost  77  sec  for  each  processor  per  time  step.  This 
results  in  a  total  cost  of  206  hr  for  each  processor  to  achieve  the  nonlinear  inflectional  velocity 
profile  state  described  by  Joslin  and  Streett  ( 1993b).  With  the  dedicated  use  of  a  32-processor 
hypercube  with  16  megabytes  of  memory  per  processor,  a  simulation  could  be  completed 
in  approximately  9  days,  which  is  nearly  twice  the  cost  of  using  a  supercomputer.  This 
comparison  is  a  rough  estimate  of  the  total  computational  cost  required  for  the  siimdations 
because  only  small  grids  can  be  used.  On  a  grid  with  64  streamwise,  41  wall-normal,  and 
32  spanwise  points,  the  computational  cost  of  3.0  sec  resulted  on  a  single  processor  of  a 
Cray-Y/jMP;  the  performance  was  189  megaflops.  For  the  same  grid,  the  computational  cost 
on  the  hypercube  resulted  in  3.7  sec  for  each  processor  and  roughly  153  megaflops. 

From  the  estimate,  simulations  can  apparently  be  performed  on  the  hypercube,  provided 
that  each  processor  has  at  least  16  megabytes  of  memory.  Similar  to  using  supercomputers. 
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the  hyperciibe  would  recpiire  a  number  of  days  to  complete  a  single  simidation.  To  decrease 
the  memory  and  computer-cost  requirements,  two  basic  alternatives  can  be  explored.  The 
first  alternative  is  an  increase  in  the  number  of  processors  in  use  for  a  given  grid;  the  second 
alternative  is  the  reduction  of  the  computational  grid  size  for  a  given  simulation.  However, 
by  decH'asing  the  size  of  the  grid,  the  PSDNS  will  not  resolve  the  smaller  scales  (subgrid), 
which  will  degrade  the  results.  To  capture  these  small  scales  with  an  appropriate  model,  the 
PSDNS  approach  becomes  a  large-eddy  simulation  (LES)  code,  or  PSLES.  As  discussed  and 
demonstrated  by  Piomelli  et  al.  (1990),  LES  can  reduce  the  computational  grid  and  cost  by 
an  order  of  magnitude  in  comparison  with  DNS.  If  a  subgrid-scale  mode  could  accurately 
capture  the  physics  in  the  boundary-layer  flow,  then  a  PSLES  on  a  computational  grid  of 
256  streamwise,  41  wall-normal,  and  32  spanwise  points,  distributed  on  32  processors,  could 
potentially  be  computed.  The  cost  for  a  single  time  step  would  be  14  sec  for  each  processor. 
For  the  swept-wing  problem  described  by  Joslin  and  Streett  (1993b),  the  total  computational 
cost  on  this  LES  grid  would  amount  to  37.5  hr  for  each  processor,  or  1.5  days.  Although 
the  PSLES  performance  scalings  are  slightly  underestimated  because  additional  costs  are 
involved  with  this  model,  PSLES  seems  plausible,  and  its  use  on  parallel  computers  will  be 
explored  in  the  near  future. 

8.  CONCLUDING  REMARKS 

The  performance  of  a  recently  implemented  parallel  spatial  direct  numerical  simvdation 
(PSDNS)  approach  on  the  Intel  iPSC/860  is  documented.  The  PSDNS  results  are  i.n  good 
agreement  with  linear  stability  theory  for  a  small-amplitude  test  case,  which  serves  as  the 
initial  validation  of  the  code  on  the  parallel  computer.  The  performance  results  show  nearly 
ideal  linear  speedups.  which  are  achieved  by  increasing  the  number  of  processors.  The 
computational  cost  shows  nearly  theoretical  linear  increases  with  streamwise,  walTnormal, 
and  spanwise  grid  refinements.  The  results  show  that  the  work  is  well  balanced  between!  the 
proct'ssors  (e’xcept  the  first  iKide.  which  will  have  approximately  15  to  20  percent  larger  work 
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load  because  of  the  numerical  techniquf^.s  employed).  Furthermore,  a  speedup  with  a  factor 
of  4  to  5  was  obtiiined  by  using  machine- dependent  libraries  rather  than  standard  Fortran 
routines. 

The  feasibility  of  using  the  PSDNS  on  the  hypercube  to  compute  transitional  flows  is 
assessed.  A  comparative  study  with  the  Cray  supercomputer  demonstrates  that  PSDNS 
could  be  used  for  transition  studies  on  the  hypercube,  provided  that  each  processor  had 
16  megabytes  of  memory.  Furthermore,  the  use  of  a  subgrid-scale  model  to  compute  large- 
eddy  simulations  (PSLES)  would  reduce  the  computational  cost  by  an  order  of  magnitude 
compared  with  PSDNS.  Large-eddy  simulations  could  readily  be  used  to  study  transition  on 
the  hypercube  at  a  reasonable  computational  cost. 
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Table  1.  Operation  Counts  for  the  Major  Kernels 


Kernel 

Operation  count  (oc) 

Normalization  count 

=  OcInzHyTlz 

MAT-MAT 

O(nrn^nj) 

0(^J) 

EFT 

Q{nxnynJog2nz) 

O(log2  riz) 

TRIDIAG 

O(nxnynz) 

0(1) 

PENTADIAG 

O(nxnynz) 

0(1) 

Table  2.  Major  Kernel  Executions  With  Different  Data  Mappings 


Kernel 

x-mapping 

y-mapping 

2-mapping 
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local 
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Table  3.  Illustration  of  the  xor  Scheme  for  Complete  Exchanges 
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Fig.  1.  Computational  domain  of  boundary-layer  transition  problem. 
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Fig.  2.  The  ^-mapping  of  nj  =  8  spanwise  grid  onto  four-processor  machine. 
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Fig.  3.  Amplitude  growth  with  downstream  distance  for  a  Tollmien-Schlichting 
wave  with  initial  amplitude  A°  =  1  x  10“*’.  Reynolds  number  Rf,>  =  900,  and 
frequency  u!  =  0.0774. 
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Fig.  5.  Computational-cost  breakdown  and  speedup  with  number  of  processors 
{tip)  for  initial  implementation  of  PSDNS,  where  Uj  =  64,  Uy  =  41,  and  n,  =  64. 
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Fig.  6.  Computational-cost  breakdown  and  slowdown  with  spanwise  grid  (nj)  for 
initial  implementation  of  PSDNS,  where  =  64,  Uy  =  41,  and  rip  =  8. 
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Fig.  8.  Computational  and  communication  cost  with  number  of  processors  {rip)  for 
optimized  PSDNS,  where  rix  =  64  and  riy  ~  41. 


34 


100 
80 
60 

Comp ,  % 

40 
20 
0 

5  10  15  20  25  30  35 

-O-  Total 
Comm 
-A-  FFT 
-O-  Mat-Mat 
Tridiag 

-X-  Pentadiag 


5  10  15  20  25  30  35 


Ty»8 


3  h 


I  '  1  '  I  ' 


jO 


A/'* 


_L 


_ A 

I  ■ 


Fig.  9.  Computational-cost  breakdown  and  speedup  with  number  of  processors 
(rip)  for  optimized  PSDNS,  where  Ux  —  64  and  Uy  —  41,  and  n~  =  64. 
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Fig.  10.  Computational-cost  breakdown  and  slowdown  with  spanwise  grid  (uj)  for 
optimized  PSDNS.  where  rix  =  64,  Uy  =  41,  and  rip  =  8. 
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Fig.  11.  Coinpntatioiial-cost  breakdown  for  one  (top)  and  four  (bottom)  x  —  y 
plane.s  per  processor  and  number  of  processors  (rip)  for  optimized  PSDNS,  where 
Ux  =  64  and  tiy  =  41. 
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Fig.  12.  Computational  and  communication  cost  with  number  of  processors  (rip) 
for  optimized  PSDNS,  where  =  128  and  n„  =  41. 
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Fig.  13.  Computational- cost  breakdown  and  speedup  with  number  of  processors 
(rip)  for  optimized  PSDNS,  where  rtj-  =  128,  Uy  =  41,  and  n.  =  64. 
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Fig.  14.  Computational- cost  breakdown  and  speedup  with  spanwise  grid  (uj)  for 
optimized  PSDNS,  where  =  128,  =  41,  and  rip  =  8. 
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Fig.  15.  Computational-cost  breakdown  for  one  (top)  and  four  (bottom)  x  —  y  plains 
per  processor  and  number  of  processors  for  optimized  PSDNS,  where  rij.  =  128  and 
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Fig.  16.  Computational-cost  breakdown  and  slowdown  with  streamwise  grid  refine¬ 
ment  for  optimized  PSDNS,  where  Uy  =  41,  =  32,  and  Up  =  16. 
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Fig.  17.  Computational-cost  breakdown  and  slowdown  with  streamwise  grid  refine¬ 
ment  for  optimized  PSDNS,  where  =  64,  n~  =  32,  and  Up  =  16. 
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